Raft's Log Replication Protocol

Learn how the log replication works in Raft.

A log comprises one or multiple log entries, and each entry holds an executable application-specific command. The basic responsibility of Raft is to maintain a consistent sequential log across all servers. Let’s see how Raft implements this in its algorithm.

Log replication#

After a leader is chosen in an election, it handles client requests. These requests contain instructions to be executed by the state machines. The leader adds the instruction to its log as a new entry and sends AppendEntries RPCs to all other servers in parallel to replicate the entry. Once the entry has been replicated safely (after a majority of the followers have successfully replicated it), the leader applies it to its own state machine and provides the result of the execution to the client.

Suppose followers malfunction or operate slowly, or network packets are lost. In that case, the leader will continue to retry AppendEntries RPCs until all followers store all log entries, even after it has already responded to the client.

A client sends a request (containing the state machine command) to the leader
A client sends a request (containing the state machine command) to the leader

1 of 5

The leader replicates the command to its log and sends AppendEntries RPCs to all the followers
The leader replicates the command to its log and sends AppendEntries RPCs to all the followers

2 of 5

The leader receives the replies from all the servers to achieve consensus
The leader receives the replies from all the servers to achieve consensus

3 of 5

The leader commits the entry in its log and the local state machine acts on the command. The result of the command is sent to the client.
The leader commits the entry in its log and the local state machine acts on the command. The result of the command is sent to the client.

4 of 5

All the followers commit the same log entry in their indexes
All the followers commit the same log entry in their indexes

5 of 5

Every record in the log contains an instruction for the state machine and the term number corresponding to when the leader received this instruction. The term numbers recorded in the log are utilized to identify disparities among logs, which we’ll discuss later in this lesson. Each entry in the log is assigned a unique integer index indicating its position within the log.

A single log entry
A single log entry

The following illustration shows a sample log organization in a Raft group:

Single entries with sequential ordering make the log. An entry is considered committed if the majority of servers already have that entry in their logs—it's safe to apply that entry to state machines. Different colors depict different terms. With each new term a different node might have become the leader. We have placed that leader at the top in the picture above.
Single entries with sequential ordering make the log. An entry is considered committed if the majority of servers already have that entry in their logs—it's safe to apply that entry to state machines. Different colors depict different terms. With each new term a different node might have become the leader. We have placed that leader at the top in the picture above.

The responsibility of determining when it is safe to implement a log entry into the state machines lies with the leader. This process is known as committing an entry. The Raft consensus algorithm ensures that committed entries are durable and will eventually be executed by all available state machines. An entry in the log is considered committed when the leader who created it has replicated it on most servers. It also commits all previous entries in the leader’s log, even the ones that were created by earlier leaders. (We’ll elaborate on this point in the next lesson.) The leader keeps track of the highest index that has been committed and includes it in future AppendEntries RPCs, including heartbeats, so that the other servers are eventually informed. When a follower learns that a log entry has been committed, it applies the entry to its local state machine in the order it appears in the log.

// LogEntry represents a single entry in the Raft log
type LogEntry struct {
    Term int         // Term in which the entry was created
    Data interface{} // Application-specific command/data
}


func (rf *Raft) logReplication() {
    for {
        rf.mu.Lock()
        for i := range rf.peers {
            // Skip sending log entries to self
            if i == rf.me {
                continue
            }

            // Send AppendEntries RPC to follower
            go func(peer int) {
                rf.mu.Lock()
                if rf.state != Leader {
                    rf.mu.Unlock()
                    return
                }

                // Prepare arguments for AppendEntries RPC
                prevLogIndex := rf.nextIndex[peer] - 1
                prevLogTerm := 0
                if prevLogIndex > 0 {
                    prevLogTerm = rf.log[prevLogIndex].Term
                }
                entries := make([]LogEntry, len(rf.log[rf.nextIndex[peer]:]))
                copy(entries, rf.log[rf.nextIndex[peer]:])
                rf.mu.Unlock()

                // Send AppendEntries RPC
                args := AppendEntriesArgs{
                    Term:         rf.currentTerm,
                    LeaderID:     rf.me,
                    PrevLogIndex: prevLogIndex,
                    PrevLogTerm:  prevLogTerm,
                    Entries:      entries,
                    LeaderCommit: rf.commitIndex,
                }
                var reply AppendEntriesReply
                if rf.sendAppendEntries(peer, &args, &reply) {
                    rf.mu.Lock()
                    defer rf.mu.Unlock()

                    // Update nextIndex and matchIndex
                    if reply.Success {
                        rf.nextIndex[peer] = args.PrevLogIndex + len(args.Entries) + 1
                        rf.matchIndex[peer] = rf.nextIndex[peer] - 1

                        // Update commitIndex if majority has replicated log entries
                        for n := rf.commitIndex + 1; n < len(rf.log); n++ {
                            if rf.log[n].Term != rf.currentTerm {
                                continue
                            }
                            count := 1
                            for i := range rf.peers {
                                if i == rf.me {
                                    continue
                                }
                                if rf.matchIndex[i] >= n {
                                    count++
                                }
                            }
                            if count > len(rf.peers)/2 {
                                rf.commitIndex = n
                            }
                        }
                    } else {
                        // Decrement nextIndex and retry AppendEntries RPC
                        if reply.Term > rf.currentTerm {
                            rf.currentTerm = reply.Term
                            rf.state = Follower
                            rf.votedFor = -1
                            rf.persist()
                            return
                        }
                        rf.nextIndex[peer]--
                    }
                }
            }(i)
        }
        rf.mu.Unlock()

        // Sleep to prevent busy looping
        time.Sleep(50 * time.Millisecond)
    }
}

The Log Matching Property#

The Raft log mechanism was created to synchronize the logs on various servers. This simplifies the system’s operations and enhances predictability, and is crucial for ensuring safety. The Log Matching Property comprises several properties maintained by Raft, including the following subproperties:

  • If two entries with the same index and term appear in different logs, they must contain the same command.
  • If two entries in separate logs have the same index and term, all previous entries must be identical.

The first property is a consequence of the leader’s capability to generate a single record per log index in a term. Since the entries are also permanent, they never change position. The second property is ensured through a basic consistency check executed by AppendEntries. When the leader sends an AppendEntries RPC, it includes the previous entry’s index and term from its log that immediately precedes the new entries. If the follower cannot locate an entry with the same index and term in its log, it denies the new entries. This consistency check functions as an inductive process: the Log Matching Property is fulfilled by the empty state of the logs at the start. The consistency check upholds the Log Matching Property whenever logs are extended. Hence, whenever AppendEntries executes successfully, the leader knows that the follower’s log corresponds to its own log up to the new entries.

Inconsistent logs#

In normal operation, the leader and followers maintain synchronized logs to ensure that the consistency check for AppendEntries always succeeds. However, if the leader crashes, the logs can become inconsistent because the old leader may have yet to fully replicate all the entries in its log. These inconsistencies can accumulate over time due to subsequent leader and follower crashes. The follower’s log may be lacking entries that exist on the leader, it may have additional entries that are absent on the leader, or it may have both missing and extra entries in its log that may span across multiple terms.

The following illustration depicts some probable scenarios of inconsistent logs. Each box represents a single log entry with the number corresponding to the term number written inside it:

The follower is missing a single entry in its log: An entry at the 10th index for term 6. Moreover, no entry got committed for term 3 because of a drawn election where no one won.
The follower is missing a single entry in its log: An entry at the 10th index for term 6. Moreover, no entry got committed for term 3 because of a drawn election where no one won.

1 of 6

The follower is missing multiple entries in its log, from log index 6 to log index 10. This follower has entries missing from two different terms, term 5 and term 6.
The follower is missing multiple entries in its log, from log index 6 to log index 10. This follower has entries missing from two different terms, term 5 and term 6.

2 of 6

The follower has an uncommitted extra entry in its log from term 6, at index 11. As the leader hasn't yet committed this entry, this follower might have it from a crashed leader who replicated this entry but couldn't commit it
The follower has an uncommitted extra entry in its log from term 6, at index 11. As the leader hasn't yet committed this entry, this follower might have it from a crashed leader who replicated this entry but couldn't commit it

3 of 6

The follower has multiple extra uncommitted log entries from term 7, at index 11 and 12. The follower has gotten these entries from a crashed leader who had a higher term number than the current leader. The current leader has committed entries from term 6, but the follower has entries in its log from term 7.
The follower has multiple extra uncommitted log entries from term 7, at index 11 and 12. The follower has gotten these entries from a crashed leader who had a higher term number than the current leader. The current leader has committed entries from term 6, but the follower has entries in its log from term 7.

4 of 6

The follower simultaneously have extra uncommitted entries from term 4, at index 6 and 7, and missing log entries from term 6, at index 8, 9, and 10
The follower simultaneously have extra uncommitted entries from term 4, at index 6 and 7, and missing log entries from term 6, at index 8, 9, and 10

5 of 6

This scenario might happen if the server led term 2, added some entries to its log at index 4, 5, and 6, and then crashed before committing any of them. It quickly restarted, took the leader role in term 3, and again added some additional entries to its log, at index 7, 8, 9, 10, and 11. However, before any of the entries in either term 2 or term 3 were committed, the server crashed once more and was unavailable for several terms.
This scenario might happen if the server led term 2, added some entries to its log at index 4, 5, and 6, and then crashed before committing any of them. It quickly restarted, took the leader role in term 3, and again added some additional entries to its log, at index 7, 8, 9, 10, and 11. However, before any of the entries in either term 2 or term 3 were committed, the server crashed once more and was unavailable for several terms.

6 of 6

Consistency check#

In order to ensure that a follower’s log is consistent with the leader’s log, the leader must locate the most recent entry where the two logs match, remove any entries in the follower’s log after that point, and then transmit all entries from the leader’s log after that point to the follower. These actions are initiated by the consistency check conducted by AppendEntries RPCs.

The leader maintains a nextIndex for each follower, which specifies the index of the next log entry that the leader will send to that follower. When a leader assumes power, it initializes all nextIndex values to the index immediately following the last one in its log. When the logs of a follower and the leader do not match, the consistency check for AppendEntries will fail in the next AppendEntries RPC. After rejection, the leader reduces nextIndex and repeatedly attempts the AppendEntries RPC. Eventually, nextIndex will reach a point where the leader and follower logs agree. When this happens, AppendEntries succeeds, eliminating any conflicting entries in the follower’s log and appending entries from the leader’s log (if any). When AppendEntries succeeds, the follower’s log is consistent with the leader’s, and it will remain that way for the rest of the term.

When a leader assumes power, it initializes the nextIndex values for all the followers to be one value greater than the index of its last entry
When a leader assumes power, it initializes the nextIndex values for all the followers to be one value greater than the index of its last entry

1 of 13

The leader sends an AppendEntries RPC to perform a consistency check
The leader sends an AppendEntries RPC to perform a consistency check

2 of 13

Since the log of the follower doesn't match the leader, AppendEntries RPC consistency check does not succeed
Since the log of the follower doesn't match the leader, AppendEntries RPC consistency check does not succeed

3 of 13

After rejection, the leader reduces the nextIndex
After rejection, the leader reduces the nextIndex

4 of 13

The leader again sends the AppendEntries RPC to perform a consistency check
The leader again sends the AppendEntries RPC to perform a consistency check

5 of 13

Since the log does not match, the consistency check still fails
Since the log does not match, the consistency check still fails

6 of 13

After rejection, the leader again reduces the nextIndex
After rejection, the leader again reduces the nextIndex

7 of 13

The leader again sends the AppendEntries RPC to perform a consistency check
The leader again sends the AppendEntries RPC to perform a consistency check

8 of 13

Since the log does not match, the consistency check still fails
Since the log does not match, the consistency check still fails

9 of 13

After rejection, the leader again reduces the nextIndex
After rejection, the leader again reduces the nextIndex

10 of 13

The leader again sends the AppendEntries RPC to perform a consistency check
The leader again sends the AppendEntries RPC to perform a consistency check

11 of 13

Now the nextIndex value of the leader matches with the index of the last entry in the follower, the consistency check succeeds
Now the nextIndex value of the leader matches with the index of the last entry in the follower, the consistency check succeeds

12 of 13

The leader will now continue replicating the missing entries to the follower's log. Now, the logs are consistent.
The leader will now continue replicating the missing entries to the follower's log. Now, the logs are consistent.

13 of 13

Note: Over here, we applied the consistency check mechanism to the scenario where the follower log had multiple missing values as compared to the leader's log. We encourage learners to apply the consistency check mechanism to all other scenarios discussed earlier as well.

Point to ponder

Question

How can we optimize the consistency check mechanism?

Hide Answer

If needed, the consensus algorithm can be improved to lessen the amount of rejected AppendEntries RPCs. For example, when a follower rejects an AppendEntries request, it can provide the term of the conflicting entry and the last index stored for that term in its log. Using this information, the leader can decrease nextIndex to skip all the conflicting entries in that term. This results in only one AppendEntries RPC being needed for each term with inconsistent entries instead of one RPC for each entry.

Referring to the above illustration:

  1. The leader assumes the authority and initiates the nextIndex as the next index in its own log, index 11 in the illustration.
  2. The leader sends the AppendEntries RPC with a consistency check to see if the follower’s last index, index 8, of the committed entry matches its index.
  3. The follower rejects the AppendEntries RPC and replies with the term of the conflicting entry, term 6, and the last index of that term, index 8.
  4. The leader skips all the conflicting intermediate entries and sets the value of nextIndex as the next of the received index value from the follower.

However, this optimization may not be necessary as failures are rare and inconsistent entries are unlikely to occur frequently in practice.

Using this mechanism, a leader doesn’t have to take special measures to ensure log consistency upon regaining power. The leader can resume normal operation, and the logs will automatically converge in response to failures during the AppendEntries consistency check. The leader always adheres to the Leader Append-Only Property and never alters or deletes entries in its own log.

Point to ponder

Question

Why is the Leader Append-Only Property important, and what are the potential risks of not following it?

Hide Answer

Raft’s Leader Append-Only property states that a leader only performs one action on its log, which only appends entries. It neither deletes nor overwrites them. This is a crucial property to ensure data consistency and safety among servers.

  • Data inconsistency: If a leader deletes or overwrites its previous log entries, those entries would not be consistent among servers, hence the inconsistency.
  • Data loss: Not meeting this property can also lead to data loss. If a leader alters or deletes its log entries but before committing these changes to other servers, it crashes, then that data would be lost.

Raft can accept, replicate, and apply new log entries as long as most servers are functioning. In typical situations with normal operations, where we don’t have any malfunctions, a new entry can be replicated with a single round of RPCs to the majority of the cluster (one set of RPCs from the leader to all the followers and one set of replies from all the followers). Hence, having a single slow follower will not impact the system’s performance heavily.

In the next lesson, we’ll learn some additional protocols of the Raft consensus algorithm covering safety, fault tolerance, and availability.

Raft's Leader Election Protocol

Raft's Safety, Fault-Tolerance, and Availability Protocols